import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.io as pio
pio.renderers.default = "iframe_connected"
config={'showLink': True, 'displayModeBar': True}
from IPython.display import IFrame
from IPython.core.display import HTML, display
We have ad data provided by Avazu. The data was taken from a Kaggle Competition of Click-Through Rate Prediction. We use pandas to import a subset of a very large data. The original data is ~7GB in size, so we have taken the first 200000 rows for our analysis.
df = pd.read_csv("data/train_subset.csv")
df.head()
Given above data, it is necessary we understand what each column means.
df.shape
Our data has 200000 rows / records and 24 features / columns.
df.dtypes
Our features can be broadly classified into following categories:
df["hour"] = pd.to_datetime(df["hour"])
df.info()
The first thing which we check usually is if there is NaN values. Our data doesn't seem to have null values and all the data seems to be there. If in case we had null values, we could have replaced them by mean of values of the column one at a time, or looking at other related columns and then adapting our missing values accordingly.
def count_unique(d, columns):
for column in columns:
print("Number of Unique values in column {} is {}".format(column, str(len(d[column].unique()))))
columns = list(df.columns)
count_unique(df, columns)
Here, we would like to know the Click-Through Rate (CTR) of the given dataset which we have. CTR is defined as the number of users who click an ad on a particular page to the total number of users who happen to visit that page. A higher CTR indicates that a lot of users were interested in our ads which we hosted on a particular website / page. Below we try to observe how many people in our data actually clicked the ad. We then calculate the CTR to get an estimate of how well our ad is performing.
fig = px.histogram(df, x="click")
fig.update_layout(title="Click histogram")
fig.write_html("plots/click_histogram.html")
# plot(fig, filename = 'plots/click_histogram.html')
# display(HTML('plots/click_histogram.html'))
IFrame('plots/click_histogram.html', height=600, width=1000)
df["click"].value_counts()
CTR = len(df[df["click"] == 1]) / len(df)
print("Click-Through Rate (CTR): {}".format(str(CTR)))
From the above histogram and statistic, we can observe that about 166132 users do not click on our ad but only a small fraction of 33868 users happen to click our ad. Our CTR is ~17%. This means that about 83% of the people do not click on the ad at all!
Here we explore our datetime feature hour and observe how CTR varies based on days and various time of the day. This is particularly important since it will help us give insights as to what kind of ads can we put and at what time of the day or week is the traffic most promising for giving us profits. We know that the amount of people doing online shopping on Black Friday or Cyber Monday is high. These days usually occur every year in October-November period. Likewise, since most of the people are working during the day, we can assume the amount of people visiting websites or clicking ads is higher during night time or any time after 5-6 pm more specifically after the official business hours. We also need to consider weekends as well as geographic locations for which our CTR scores can be significantly impacted. Other possible events include, election periods, festivals, etc where we might expect people to click on ads more. Hence, datetime based analysis needs to be done to understand the CTR trend
df["hour"].describe()
click_day = df.groupby('hour').agg({'click':'sum'}).reset_index()
click_day.head()
fig = go.Figure(go.Scatter(name="clicks/day",
x = click_day['hour'],
y = click_day['click'],
hovertemplate='Date: %{x|%d %B %Y} <br>Time: %{x|%H:%M:%S} <br>Day: %{x|%A} <br>Clicks: %{y}'
))
fig.update_layout(
title = 'Trend of clicks grouped by day for all hours',
xaxis_tickformat = '%d %B <br>%Y',
xaxis_title = "Hourly clicks for 10 days",
yaxis_title = "Number of Clicks"
)
fig.show()
Above we have a plot of the amount of clicks made every hour for the 10 days of data given between 21st October 2014 to 31st October 2014. We see peaks in the clicks made on 22nd and 28th of October somewhere around mid-day. Likewise, we see a surprising dip during 25th of October at night. Apart from these 3 outlier peaks, the hourly click rate seems pretty stationary and the trend seems to be almost the same for the rest of the days.
Earlier we plotted days vs clicks done by users per hour. Now, we would like to see how many clicks were made for each hour for all the days. Basically we sum all the clicks made for the first hour of all the days, the second hour for all the days, etc for all the hours in the day. Our X-axis will be hours all the 24 hours. This will give us the trend of the how the clicks vary every day for a particular hour. We perform feature engineering to achieve our plot
df['hour_of_day'] = df["hour"].apply(lambda x: str(x.time())[:5])
click_hr = df.groupby('hour_of_day').agg({'click':'sum'}).reset_index()
click_hr.head()
fig = go.Figure(go.Scatter(name="clicks/hr",
x = click_hr['hour_of_day'],
y = click_hr['click'],
hovertemplate='Time: %{x} <br>Clicks: %{y}'
))
fig.update_layout(
title = 'Trend of clicks grouped by hours for all the days',
xaxis_title = "Hours",
yaxis_title = "Number of Clicks"
)
fig.show()
From the above trend the highest clicks are made every day during 12:00 pm to 2:00 pm. The amount of clicks done is less during the initial and the later part of the day. This means that people become more active during the business hours of the day, rather than towards the end of the day or the beginning of the day! This confirms our earlier observation as described by the earlier chart.
Impressions are when ads are rendered on a user screen or any other form of digital media platform. Impressions are not action-based and are merely defined by a user potentially seeing the advertisement. Hence, it doesn't really matter if someone clicked the ad or not, the impressions are just the fact that the ad was observed by any person and they saw it with/without any action taken on it.
df.head()
We group our data firstly based on hour and then based on click. This helps us achieve multi-level grouping. However, we would like to bring all data to one level hence we unstack() it and plot a graph comparing clicks and non-clicks for every hour for all the days
impressions = df.groupby(['hour_of_day', 'click']).size().unstack().reset_index()
impressions.head()
fig = go.Figure(data=[
go.Bar(name='Clicked', x=impressions["hour_of_day"], y=impressions[1],
hovertemplate='Time: %{x} <br>Clicks: %{y}', marker_color='rgb(55, 83, 109)'),
go.Bar(name='Not Clicked', x=impressions["hour_of_day"], y=impressions[0],
hovertemplate='Time: %{x} <br>Clicks: %{y}', marker_color='rgb(26, 118, 255)')
])
# Change the bar mode
fig.update_layout(
title = 'Hourly Impressions based on Clicks',
xaxis_title = "Hour of the day",
yaxis_title = "Impressions / hr",
barmode='group',
)
fig.show()
Above figure shows us hourly impressions, which means that for every hour, a significantly high number of people saw the ads but only a fraction of them actually clicked it and were forwarded to a landing page.
Earlier we saw hourly and daily clicks made on our ads. Now we would like to observe the hourly Click-Through Rate (CTR). Click-Through Rate is the number of times the ad was clicked by the total impressions. We calculate how many times the ad was clicked in an hour and divide it by the total impressions of that hour. This will give us hourly Click-Through Rate.
just_clicks = df[df['click'] == 1]
hourly_ctr = df[["hour_of_day", "click"]].groupby(["hour_of_day"]).count().reset_index()
hourly_ctr = hourly_ctr.rename(columns={'click': 'impressions'})
hourly_ctr["clicks"] = just_clicks[["hour_of_day", "click"]].groupby(["hour_of_day"]).count().reset_index()["click"]
hourly_ctr["CTR"] = hourly_ctr["clicks"] / hourly_ctr["impressions"] * 100
hourly_ctr.head()
fig = px.bar(hourly_ctr, x='hour_of_day', y='CTR',
labels={"hour_of_day": "Time"},
color='CTR',
height=400)
fig.update_layout(
title = 'Hourly CTR',
xaxis_title = "Hour of the day",
yaxis_title = "Click-Through Rate (CTR)",
)
fig.show()
Contrary to what we observed earlier of the clicks being higher during the afternoon time or during the mid of the day, the CTR values suggest that a higher number of users click an ad relative to the impressions during midnight at around 1:00 am and likewise, the second highest peaks are at 3:00 pm. - 4:00 pm in the evening. If we just consider impressions then mid-night had less impressions relative to other times of the day and the same goes to the number of clicks done on the ad; however, considering both, it's an interesting trend to see the CTR to be high during the early time of the day.
Now that we know how the CTR trend is for every hour of the day, let's observe how it is for every day in the week. We will basically observe 3 things:
df.head()
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
df["day_of_week"] = df["hour"].apply(lambda x: days[x.weekday()])
click_days = df.groupby("day_of_week").agg({"click": "sum"}).reset_index()
click_days['day_of_week'] = pd.Categorical(click_days['day_of_week'], categories=days, ordered=True)
click_days = click_days.sort_values('day_of_week')
click_days.head(7)
fig = go.Figure(go.Scatter(name="clicks/day",
x = click_days['day_of_week'],
y = click_days['click'],
hovertemplate='Day: %{x} <br>Clicks: %{y}',
marker_color="darkolivegreen"
))
fig.update_layout(
title = 'Trend of clicks grouped by days',
xaxis_title = "Day of Week",
yaxis_title = "Number of Clicks"
)
fig.show()
impressions = df.groupby(['day_of_week', 'click']).size().unstack().reset_index()
impressions['day_of_week'] = pd.Categorical(impressions['day_of_week'], categories=days, ordered=True)
impressions = impressions.sort_values('day_of_week')
impressions.head(7)
fig = go.Figure(data=[
go.Bar(name='Clicked', x=impressions["day_of_week"], y=impressions[1],
hovertemplate='Day: %{x} <br>Clicks: %{y}', marker_color='indianred'),
go.Bar(name='Not Clicked', x=impressions["day_of_week"], y=impressions[0],
hovertemplate='Day: %{x} <br>Clicks: %{y}', marker_color='lightsalmon')
])
# Change the bar mode
fig.update_layout(
title = 'Daily Impressions based on Clicks',
xaxis_title = "Day of week",
yaxis_title = "Impressions / day",
barmode='group',
)
fig.show()
just_clicks = df[df['click'] == 1]
daily_ctr = df[["day_of_week", "click"]].groupby(["day_of_week"]).count().reset_index()
daily_ctr = daily_ctr.rename(columns={'click': 'impressions'})
daily_ctr["clicks"] = just_clicks[["day_of_week", "click"]].groupby(["day_of_week"]).count().reset_index()["click"]
daily_ctr["CTR"] = daily_ctr["clicks"] / daily_ctr["impressions"] * 100
daily_ctr['day_of_week'] = pd.Categorical(daily_ctr['day_of_week'], categories=days, ordered=True)
daily_ctr = daily_ctr.sort_values('day_of_week')
daily_ctr.head(7)
fig = px.bar(daily_ctr, x='day_of_week', y='CTR',
labels={"day_of_week": "Day"},
color='CTR',
height=400)
fig.update_layout(
title = 'Daily CTR',
xaxis_title = "Day of week",
yaxis_title = "Click-Through Rate (CTR)",
)
fig.show()
Our daily CTR graph shows that on Saturday and on Sunday the chances of the ad being clicked is higher. This is reasonable since on weekends people will have more time to spend online and come across ads and click them.
Now that we have understood the effect of clicks based on hours and days in the week, with different combinations, let us understand effect of other variables on the target click.
The kind of website hosting our ads as a huge impact on our clicks. Is the website a famous one, is it a commercial e-commerce one, or is it just a blogging site, etc; a lot of factors related to site plays an important role into whether a person will click an ad rendered on it. Firstly as calculated earlier, we have 1788 unique websites.
print("Number of unique websites: {}".format(str(len(df["site_id"].unique()))))
# top5 websites based on number of ads displayed in them
siteids = df["site_id"].value_counts()[:5].index
site_impressions = df["site_id"].value_counts()[:5].values
print("Top5 websites based on impressions: \n{}".format(siteids))
top5_sites = df[(df["site_id"].isin(siteids))]
top5_sites_click = top5_sites.groupby(['site_id', 'click']).size().unstack().reset_index()
top5_sites_click = top5_sites_click.sort_values(by=1, ascending=False).reset_index()
top5_sites_click["site_impressions"] = site_impressions
top5_sites_click = top5_sites_click.rename(columns={0: 'Not Clicked', 1: "Clicked"})
top5_sites_click.columns.name = None
top5_sites_click = top5_sites_click.drop(["index"], axis=1)
top5_sites_click.head()
fig = go.Figure(data=[
go.Bar(name='Clicked', x=top5_sites_click["site_id"], y=top5_sites_click["Clicked"],
hovertemplate='Site ID: %{x} <br>Clicks: %{y}', marker_color='seagreen'),
go.Bar(name='Not Clicked', x=top5_sites_click["site_id"], y=top5_sites_click["Not Clicked"],
hovertemplate='Site ID: %{x} <br>Clicks: %{y}', marker_color='firebrick')
])
# Change the bar mode
fig.update_layout(
title = 'Top5 Sites based on Clicks',
xaxis_title = "Top5 Site IDs",
yaxis_title = "Impressions / site",
barmode='group',
)
fig.show()
Of the 1788 sites on which our ads are placed, we have the top 5 sites in terms of amount of impressions they had. As before, a lot of people happen to see the ads but only few of them end up clicking on them. This is evident by the green bars shown above.
top5_sites_click['CTR'] = top5_sites_click['Clicked'] / top5_sites_click['site_impressions'] * 100
top5_sites_click.head()
fig = px.bar(top5_sites_click, x='site_id', y='CTR',
labels={"site_id": "Site Id"},
color='CTR',
height=400)
fig.update_layout(
title = 'CTR values of Top5 Sites',
xaxis_title = "Top5 Site Ids",
yaxis_title = "Click-Through Rate (CTR)",
)
fig.show()
We see that although site id 85f751fd had more impressions, site id 5b08c53b had high CTR value. So it might be the case that this sight must be having keywords which really describe the ads and that on having clicked on it the user is directed to an appropriate landing page.
print("Number of unique domains: {}".format(str(len(df["site_domain"].unique()))))
# top5 domains based on number of ads displayed in them
sitedomains = df["site_domain"].value_counts()[:5].index
domain_impressions = df["site_domain"].value_counts()[:5].values
print("Top5 site domains based on impressions: \n{}".format(sitedomains))
top5_domains = df[(df["site_domain"].isin(sitedomains))]
top5_domains_click = top5_domains.groupby(['site_domain', 'click']).size().unstack().reset_index()
top5_domains_click = top5_domains_click.sort_values(by=1, ascending=False).reset_index()
top5_domains_click["domain_impressions"] = domain_impressions
top5_domains_click = top5_domains_click.rename(columns={0: 'Not Clicked', 1: "Clicked"})
top5_domains_click.columns.name = None
top5_domains_click = top5_domains_click.drop(["index"], axis=1)
top5_domains_click.head()
fig = go.Figure(data=[
go.Bar(name='Clicked', x=top5_domains_click["site_domain"], y=top5_domains_click["Clicked"],
hovertemplate='Domain ID: %{x} <br>Clicks: %{y}', marker_color='seagreen'),
go.Bar(name='Not Clicked', x=top5_domains_click["site_domain"], y=top5_domains_click["Not Clicked"],
hovertemplate='Domain ID: %{x} <br>Clicks: %{y}', marker_color='firebrick')
])
# Change the bar mode
fig.update_layout(
title = 'Top5 Domains based on Clicks',
xaxis_title = "Top5 Site Domains",
yaxis_title = "Impressions / domain",
barmode='group',
)
fig.show()
Our websites are described by a domain. If a domain is descriptive and apt then chances of people visiting it is higher, although it does not necessarily guarantee they will click the ad. If the ad is not relevant to your content or related to your core idea, it will have less CTR.
top5_domains_click['CTR'] = top5_domains_click['Clicked'] / top5_domains_click['domain_impressions'] * 100
top5_domains_click.head()
fig = px.bar(top5_domains_click, x='site_domain', y='CTR',
labels={"site_domain": "Domain ID"},
color='CTR',
height=400)
fig.update_layout(
title = 'CTR values of Top5 Domains',
xaxis_title = "Top5 Domains",
yaxis_title = "Click-Through Rate (CTR)"
)
fig.show()
Again, the 4th site has higher CTR although it had less impressions overall as compared to site 1
print("Number of website categories: {}".format(str(len(df["site_category"].unique()))))
# top5 site categories based on number of ads displayed in them
sitecategories = df["site_category"].value_counts()[:5].index
category_impressions = df["site_category"].value_counts()[:5].values
print("Top5 site categories based on impressions: \n{}".format(sitecategories))
top5_categories = df[(df["site_category"].isin(sitecategories))]
top5_categories_click = top5_categories.groupby(['site_category', 'click']).size().unstack().reset_index()
top5_categories_click = top5_categories_click.sort_values(by=1, ascending=False).reset_index()
top5_categories_click["category_impressions"] = category_impressions
top5_categories_click = top5_categories_click.rename(columns={0: 'Not Clicked', 1: "Clicked"})
top5_categories_click.columns.name = None
top5_categories_click = top5_categories_click.drop(["index"], axis=1)
top5_categories_click.head()
fig = go.Figure(data=[
go.Bar(name='Clicked', x=top5_categories_click["site_category"], y=top5_categories_click["Clicked"],
hovertemplate='Category ID: %{x} <br>Clicks: %{y}', marker_color='seagreen'),
go.Bar(name='Not Clicked', x=top5_categories_click["site_category"], y=top5_categories_click["Not Clicked"],
hovertemplate='Category ID: %{x} <br>Clicks: %{y}', marker_color='firebrick')
])
# Change the bar mode
fig.update_layout(
title = 'Top5 Categories based on Clicks',
xaxis_title = "Top5 Site Categories",
yaxis_title = "Impressions / site category",
barmode='group',
)
fig.show()
Sites can belong to various categories - ecommerce websites, healthcare websites, education websites, etc. Each category has various websites of different domains. Above graph shows how impressions vary based on site category. For instance the 2nd site category has highest impressions. Maybe it might depict ecommerce site like Amazon or Ebay which has higher footprint then a relatively less visited website like a hospital website or maybe an educational blog site catered to a specific audience.
top5_categories_click['CTR'] = top5_categories_click['Clicked'] / top5_categories_click['category_impressions'] * 100
top5_categories_click.head()
fig = px.bar(top5_categories_click, x='site_category', y='CTR',
labels={"site_category": "Site Category ID"},
color='CTR',
height=400)
fig.update_layout(
title = 'CTR values of Top5 Site Categories',
xaxis_title = "Top5 Site Categories",
yaxis_title = "Click-Through Rate (CTR)"
)
fig.show()
As before CTR values is higher for 4th site category although its impressions are lower.